1. Data Processing

Reading the csv file and storing the datframe into covidtable:

covidtable = read.csv("C:/Users/Creacion Tech/Documents/covid_data.csv")

The line below shows the first 10 columns of the data frame.

str(covidtable[,c(1:10)])
## 'data.frame':    126 obs. of  10 variables:
##  $ Sample  : chr  "C1" "C2" "C3" "C4" ...
##  $ Age     : int  39 63 33 49 49 40 38 78 64 62 ...
##  $ Sex     : chr  "male" "male" "male" "male" ...
##  $ Severity: chr  "NonICU" "NonICU" "NonICU" "NonICU" ...
##  $ A1BG    : num  0.49 0.29 0.26 0.45 0.17 0.21 0.49 0.12 0.51 0.1 ...
##  $ A1CF    : num  0 0 0 0.01 0 0 0.01 0 0.01 0 ...
##  $ A2M     : num  0.21 0.14 0.03 0.09 0 0.08 0.23 0.08 0.88 0.13 ...
##  $ A2ML1   : num  0.04 0 0.02 0.07 0.05 0.04 0.03 0.01 0.02 0.01 ...
##  $ A3GALT2 : num  0.07 0 0 0 0.07 0 0.07 0 0.79 0.15 ...
##  $ A4GALT  : num  0 0 0 0 0 0 0 0 0 0 ...

The data frame shows that all the gene sequences have numeric variables and the rest have categorical data type.

We are predicting Severity using genome sequences only. So, removing other variables from the data set.

covidtable = covidtable[,-c(1:3)] 

Visualizing the frequency distribution of the ICU/NonICU target variable.

fig <- plot_ly(x = covidtable$Severity, type = "histogram", color = covidtable$Severity)%>%
  layout(title = 'Frequency of each class')

fig

So we are able to understand the distribution of Severity values by looking at the proportion of each class.

table(covidtable$Severity) / nrow(covidtable)
## 
##       ICU    NonICU 
## 0.5238095 0.4761905

So, it is clearly a balanced distribution.

Changing the Severity “ICU” to 1 and “NonICU” to 0

covidtable$Severity[covidtable$Severity == "ICU"] = 1

covidtable$Severity[covidtable$Severity == "NonICU"] = 0

Changing the data type of Severity from character to factor.

covidtable$Severity = as.factor(covidtable$Severity)

Normalizing the genome sequences

Normalization is done to change the values of numeric columns in the data set to use a common scale, without distorting differences in the ranges of values or losing information.

# function to normalize data
Normalize_Data <- function(val) { return ((val - min(val)) / (max(val) - min(val))) }

for (col in 2:ncol(covidtable)) { 
  
  covidtable[,col] = Normalize_Data(covidtable[,col])
  
}

Removing any NAs introduced after normalization

covidtable = covidtable[ , colSums(is.na(covidtable)) == 0]

Analysing the independent variables

ncol(covidtable[,-c(1)])
## [1] 18318

There are 18318 columns. So we need to select only the ones that have some correlation with the dependent variable.

Feature selection using Boruta

Main data frame is divided into covidtable1 and covidtable2 to avoid stack overload of the computer in feature selection function.

covidtable1 = covidtable[,c(1,2:10000)]

covidtable2 = covidtable[,c(1,10001:ncol(covidtable))]

Boruta() and getSelectedAttributes() is applied on covidtable1 and covidtable2 separately and features returned by the function are stored in boruta_signif1 and boruta_signif2 respectively.

boruta_output1 <- Boruta(Severity ~ ., data=na.omit(covidtable1), doTrace=0) 

boruta_signif1 <- getSelectedAttributes(boruta_output1, withTentative = TRUE)

boruta_output2 <- Boruta(Severity ~ ., data=na.omit(covidtable2), doTrace=0) 

boruta_signif2 <- getSelectedAttributes(boruta_output2, withTentative = TRUE)

The lists boruta_signif1 and boruta_signif2 were combined

feature_selected = c(boruta_signif1,boruta_signif2)

The dataframe was filtered and only the the features which have some relation with the dependent variable were kept.

covidtable = covidtable[,c("Severity",feature_selected)]

Boruta() and getSelectedAttributes() applied to the data; however, this time withTentative was kept false which means only the features with significant correlation with “Severity” were kept.

boruta_output <- Boruta(Severity ~ ., data=na.omit(covidtable), doTrace=0)

boruta_signif <- getSelectedAttributes(boruta_output, withTentative = FALSE)

The dataframe was filtered again and this time only the the features which have significant correlation with the dependent variable were kept.

covidtable = covidtable[,c("Severity",boruta_signif)]

The list of the features selected was saved as a txt file.

write.csv(names(covidtable),"featuresSelected.txt", row.names = FALSE)

The list is read from local directory and loaded into R. However the 27 shortlisted variables also failed to converge the model. Hence first 11 variables from the list were chosen.

featuresSelected = read.csv("featuresSelected.txt")

featuresSelected = unlist(featuresSelected)

featuresSelected = featuresSelected[1:12]

List of the selected features

featuresSelected
##         x1         x2         x3         x4         x5         x6         x7 
## "Severity"     "ARG1"     "CCL3"     "CCL5"    "CDHR2"   "CYB561"  "CYSLTR1" 
##         x8         x9        x10        x11        x12 
##     "EGR2"   "FCER1A"   "GALNT6"    "GPR68"    "GRB10"

Filtering the data frame using the selected features list

covidtable = covidtable[,c(featuresSelected)]

Now our data is clean.

2. Training and Tuning

Test Train division

The seed is set which means random numbers generated each time will be same. It helps evaluate the model cause the train and test data set remains same every time. Then sample size is set to 80% of the total observation and random indexes are generated. Later, the observations with above indexes are a assigned to train set where others are assigned to test set.

set.seed(222) 

sample_size = round(nrow(covidtable)*.80) 

index <- sample(seq_len(nrow(covidtable)), size = sample_size)

train <-covidtable[index, ]

test <- covidtable[-index,]

k-fold cross-validation

train.control <- trainControl(method = "cv", number = 5)

Training the model

The logistic regression model (glm) is used for training.

model <- train(Severity ~., data = train, method = "glm",
               trControl = train.control)

3. Model Validation

model
## Generalized Linear Model 
## 
## 101 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 81, 81, 81, 81, 80 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8019048  0.6021277

The model gave 80% accuracy on the train data set which is a good proportion given that we have just 101 observations in train data set and out of which 80% are used for training the model and 20% for its validation as we are using k-fold cross-validation with k = 5.

4. Model Interpretation

summary(model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.81671  -0.37507   0.00191   0.32658   2.01942  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   1.6764     1.5480   1.083   0.2788  
## ARG1          1.9524     4.8114   0.406   0.6849  
## CCL3         -5.1250     2.2407  -2.287   0.0222 *
## CCL5          4.0904     5.0386   0.812   0.4169  
## CDHR2         8.8415     5.9049   1.497   0.1343  
## CYB561       -5.3571     2.8492  -1.880   0.0601 .
## CYSLTR1      -4.1886     2.3155  -1.809   0.0705 .
## EGR2          0.2514     2.4488   0.103   0.9182  
## FCER1A        4.1301     2.7033   1.528   0.1266  
## GALNT6        0.6902     3.2337   0.213   0.8310  
## GPR68        -6.8297     5.9687  -1.144   0.2525  
## GRB10         6.1498     4.4722   1.375   0.1691  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 139.927  on 100  degrees of freedom
## Residual deviance:  56.569  on  89  degrees of freedom
## AIC: 80.569
## 
## Number of Fisher Scoring iterations: 7

Coefficients

Above shows the summary of our linear regression model. Where there is evidence of “ARG1”, Severity (“ICU” versus “NonICU”) increases by “6.0040”.

Where there is evidence of”CDHR2”, the odds of Severity increases by “7.465”.

Similarly, other variables can be interpreted in a similar fashion.

Null Deviance and Residual Deviance

The difference between the two tells us that the model is a good fit. Greater the difference better the model. Null deviance is the value when you only have intercept in your equation with no variables and Residual deviance is the value when you are taking all the variables into account. It makes sense to consider the model good if that difference is big enough. In our case “Null deviance: 139.927” and “Residual deviance: 62.551” show that our model is a good fit.

Fisher Scoring Iterations

This is the number of iterations to fit the model. The logistic regression uses an iterative maximum likelihood algorithm to fit the data. The Fisher method is the same as fitting a model by iteratively re-weighting the least squares. It indicates the optimal number of iterations. Our model gave a “Fisher Scoring iterations: 7” which means the model was able to converge in 7 iterations.

5. Predictions

The steps below first predict the Severity using the test data set using the model above. Then those predictions are compared to the actual values and overall accuracy of the prediction is calculated at the end.

test$Predicted = predict(model,test[,c(2:ncol(test))])

Error <- mean(test$Predicted != test$Severity)

print(paste('Accuracy',round((1-Error)*100,2), "%" ) )
## [1] "Accuracy 88 %"

Confusion Matrix

A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone can be misleading. Calculating a confusion matrix can give a better idea of what classification model is getting right and what types of errors it is making.

test$Predicted = as.factor(test$Predicted)

CM <- confusionMatrix(data=test$Predicted, reference = test$Severity)

CM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  9  1
##          1  2 13
##                                           
##                Accuracy : 0.88            
##                  95% CI : (0.6878, 0.9745)
##     No Information Rate : 0.56            
##     P-Value [Acc > NIR] : 0.0006695       
##                                           
##                   Kappa : 0.7541          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8182          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.8667          
##              Prevalence : 0.4400          
##          Detection Rate : 0.3600          
##    Detection Prevalence : 0.4000          
##       Balanced Accuracy : 0.8734          
##                                           
##        'Positive' Class : 0               
## 

The confusion matrix above shows ” Pos Pred Value : 1.0000 ” and “Neg Pred Value : 0.8750” which means the model was able to predict positive, which is “1” in our case, correctly 100% of the time and negative which is “0” in our case 87% of the time. Hence our model isn’t biased and is predicting both classes of data with high accuracy.

Underfitting/Overfitting

The model gave 80% accuracy on the train data set and 92% on the test data set. This shows the mode is neither under-fitted nor over-fitted. Moreover the performance of the model could be improved if we had a larger number of observations.

6. Conclusion

We predicted the Severity of a patient using his/her genome sequences. The whole process of prediction consisted of data cleaning, training the model and asses its performance. We started with over 18000 features but after following numerous steps for features selection we were left with just 11. This step helped in letting the model converge and give a reasonable accuracy for both train and test data sets. The overall performance of the model can be improved if we have a larger number of observations compared to just 126. In a nutshell, we were able to predict the severity of a patient with 92% accuracy which is quite significant number.